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ABSTRACT 

Convex optimization is an essential tool for modern data analysis, 
as it provides a framework to formulate and solve many problems 
in machine learning and data mining. However, general convex op¬ 
timization solvers do not scale well, and scalable solvers are often 
specialized to only work on a narrow class of problems. There¬ 
fore, there is a need for simple, scalable algorithms that can solve 
many common optimization problems. In this paper, we introduce 
the network lasso, a generalization of the group lasso to a network 
setting that allows for simultaneous clustering and optimization on 
graphs. We develop an algorithm based on the Alternating Direc¬ 
tion Method of Multipliers (ADMM) to solve this problem in a dis¬ 
tributed and scalable manner, which allows for guaranteed global 
convergence even on large graphs. We also examine a non-convex 
extension of this approach. We then demonstrate that many types 
of problems can be expressed in our framework. We focus on three 
in particular — binary classification, predicting housing prices, and 
event detection in time series data — comparing the network lasso 
to baseline approaches and showing that it is both a fast and accu¬ 
rate method of solving large optimization problems. 

Categories and Subject Descriptors: H.2.8 [Database Manage¬ 
ment]: Database applications —Data mining 
General Terms: Algorithms; Experimentation. 

Keywords: Convex Optimization, ADMM, Network Lasso. 

1. INTRODUCTION 

Convex optimization has become an increasingly popular way of 
modeling problems in many different fields, ranging from finance 
§4.4] to image processing (^. However, as datasets get larger 
and more intricate, classical methods of convex analysis, which of¬ 
ten rely on interior point methods, begin to fail due to a lack of 
scalability. In fact, without any known structure to the optimiza¬ 
tion problem, the convergence time will scale with the cube of the 
problem size 0. The challenge of large-scale optimization lies in 
developing methods general enough to work well independent of 
the input and capable of scaling to the immense datasets that to¬ 
day’s applications require. Presently, solving these problems in a 
scalable way requires developing problem-specific solvers to ex- 
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ploit structure in the model |27| , often an infeasible assumption. 
Therefore, it is necessary to formulate general classes of optimiza¬ 
tion solvers that can apply to a variety of relevant problems, and to 
develop algorithms for obtaining reliable and efficient solutions. 

Present Work: Formulation. Here, we focus on optimization 
problems posed on graphs. Consider the following problem on a 
graph Q — (V, £’), where V is the vertex set and £ the set of edges: 

minimize fi{xi) + Y gjkixj,Xk)- /i', 

iev U,k)ee ^ ’ 

The variables are xi,, Xm € where m = |V|. (The total 
number of scalar variables is mp.) Here Xi € is the variable 
at node i, fi : —>■ R U {oo} is the cost function at node i, and 

Qjk : X —>■ RU{oo} is the cost function associated with edge 
{j, k). We use extended (infinite) values of fi and gjk to describe 
constraints on the variables, or pairs of variables across an edge, 
respectively. Our focus will be on the special case in which the fi 
are convex, and gjk{xj,Xk) = \Wjk\\xj — Xk\\ 2 , with A > 0 and 
user-defined weights Wjk > 0: 

minimize Y ^ Y WjkWxj - Xkh. 

iev (j.fc)e£ ^ ^ 

The edge objectives penalize differences between the variables at 
adjacent nodes, where the edge between nodes i and j has weight 
\wij. Here we can think of Wij as setting the relative weights 
among the edges of the network, and A as an overall parameter that 
scales the edge objectives relative to the node objectives. We call 
problem l|^ the network lasso problem, since the edge cost is a sum 
of norms of differences of the adjacent edge variables. 

The network lasso problem is a convex optimization problem, 
and so in principle it can be solved efficiently. For small networks, 
generic (centralized) convex optimization methods can be used to 
solve it. But we are interested in problems with many variables, 
with p, m — |V|, and n = l^l all potentially large. For such 
problems no adequate solver currently exists. Thus, we develop a 
distributed and scalable method for solving the network lasso prob¬ 
lem, in which each vertex variable Xi is controlled by one “agent”, 
and the agents exchange (small) messages over the graph to solve 
the problem iteratively. This approach provides global convergence 
for all problems that can be put into this form. We also analyze a 
non-convex extension of the network lasso, a slightly different way 
to model the problem, and give a similar algorithm that, although it 
does not guarantee optimality, tends to perform well in practice. 

Present Work: Applications. There are many general settings in 
which the network lasso problem arises. In control systems, the 
nodes might represent the possible states of a system, and Xi the 
action or actions to take when we are in state i, so the collection 
of variables ( 2 : 1 ,..., Xm) describes a policy. The graph tells us 


about state transitions, and the weights express how much we care 
about the actions in neighboring states differing. Here the network 
lasso problem seeks a solution that minimizes the total cost, but 
also does not change much across adjacent states, allowing for a 
“simpler” policy. The parameter A allows us to trade off the total 
cost (the node objective) versus a cost for the actions varying across 
the states (the edge objective). 

Another general setting, one we focus on in this paper, relates 
to statistical learning, where the variables Xi are parameters in a 
statistical model of some data resident at, or associated with, node 
i. The objective term fi represents the loss for the model over the 
data, possibly with some regularization added in. The edge terms 
are regularization that encourages adjacent nodes to have close (or 
the same) model parameters. In this setting, the network expresses 
our idea that adjacent nodes should have similar (or the same) mod¬ 
els. We can imagine that this regularization allows us to build mod¬ 
els at each node that borrow strength from the fact that neighboring 
nodes should have similar, or even identical, models. 

It is critical to note that the edge terms in the network lasso prob¬ 
lem involve the norm, not the norm squared, of the difference. If the 
norms were squared, the edge objective would reduce to (weighted) 
Laplacian regularization (25| . The sum-of-norms regularization 
that we use is like group lasso (28| ; it encourages not just Xi ~ Xj 
for edge {i,j) G £, but Xi = Xj, i.e., consensus across the edge. 
Indeed, we will see that there is often a (finite) value of A above 
which the solution has xi = ■ ■ ■ = Xm, i-e., all the vectors are 
in consensus. For smaller values of A, the solution of the network 
lasso problem breaks into clusters of nodes, with Xi the same across 
all nodes in the cluster. In the policy setting, we can think of this 
as a combination of state aggregation or clustering, together with 
policy design. In the modeling setting, this is a combination of 
clustering the data collections and fitting a model to each cluster. 

Present Work: Use Case. As a running example, which we later 
analyze in detail, consider the problem of predicting housing prices. 
One common approach is linear regression. That is, we learn the 
weights of each feature (number of bedrooms, square footage, etc...) 
and use these same weights for each house to estimate the price. 
However, due to location-based factors such as school district or 
distance to a highway, similar houses in different locations can have 
drastically different prices. These factors are often unknown a pri¬ 
ori and difficult to quantify, so it is inconvenient to attempt to in¬ 
corporate them as features in the regression. Therefore, standard 
linear regression will have large errors in price prediction, since it 
forces the entire dataset to agree on a single global model. What we 
actually want is to cluster the houses into “neighborhoods” which 
share a common regression model. First, we build a network where 
neighboring houses (nodes) are connected by edges. Then, each 
house solves for its own regression model (based on its own fea¬ 
tures and price). We use the network lasso penalty to encourage 
nearby houses to share the same regression parameters, in essence 
helping each house determine which neighborhood it is part of, and 
learning relevant information from this group of neighbors to im¬ 
prove its own prediction. The size and shape of these neighbor¬ 
hoods, though, are difficult to know beforehand and often depend 
on a variety of factors, including the amount of available data. The 
network lasso solution empirically determines the neighborhoods, 
so that each house can share a common model with houses in its 
cluster, without having to agree with the potentially misleading in¬ 
formation from other locations. 

Summary of Contributions. The main contributions of this paper 
are as follows: 


• We formally define the network lasso, a specific type of op¬ 
timization problem on networks. 

• We develop a fast, scalable, and distributed solver for any 
problem of this form. This algorithm is also capable of choos¬ 
ing the right regularization parameter A. 

• We show that many common and useful problems can be for¬ 
mulated as an instance of the network lasso. 

Related Work. The network lasso can be thought of as a spe¬ 
cial case of certain methods (Bayesian inference, general convex 
optimization) and a generalization of others (fused lasso |23| , total 
variation |24[|26| ). It occupies a unique point on the trade-off curve 
between generality and scalability that, to the best of our knowl¬ 
edge, has not yet been formally analyzed. Our approach provides 
a unified view of a diverse class of optimization problems, but is 
still capable of solving large-scale examples. For example, con¬ 
vex clustering (7l |14|[22| , an alternative to the K-means algorithm, 
is a well-studied instance of the network lasso. However, convex 
clustering requires fi to be the square loss from some observation 
fli, and often assumes a fully connected graph since there is no 
prior information about which nodes may be clustered together. In 
contrast, generalizing to any shape of network with any convex ob¬ 
jectives (including allowing constraints) allows our approach to be 
applied to new topics, such as control systems and event detection. 
Furthermore, we elect to focus on the ^ 2 -norm because of its intu¬ 
itive network-based rationale in that it leads to node stratification. 

The network lasso is also related to probabilistic graphical mod¬ 
els (PGMs). Problem is a type of Bayesian inference where we 
learn a set of models or dependencies based on latent clustering. 
The network lasso penalty, a form of regularization, allows for one 
type of “relationship” between nodes, a weighted prior belief that 
the connected variables should be equal. The clustering that our 
model accomplishes is similar to a latent variable mixture model 
(20| , where cluster membership is indicated by some latent vari¬ 
able. With this, certain network lasso problems can be rewritten 
as a maximum likelihood estimation problem where a conditional 
distribution is learned for each cluster. However, many examples 
are difficult to encode and scale in this way. Additionally, there 
has been much research on optimal decomposition and splitting 
methods for these types of problems |19|. Hinge-loss Markov 
random fields, which are PGMs defined over continuous variables 
for MAP inference, use a similar ADMM-based approach to ours 
|T], though the hinge-loss potentials they support do not include the 
norm-based lasso that we utilize to induce the clustering. However, 
unlike many of these other frameworks (T| |16|[2^ , which often use 
a probabilistic approach, we formulate it as a single, very large, 
convex optimization problem that we solve by splitting it across 
a graph. This focus on the specific topic of simultaneous cluster¬ 
ing and optimization enables us to provide a clean formalism and 
scalable approach, with guaranteed convergence, for solving a wide 
class of problems, all using the exact same algorithm. 

2. CONVEX PROBLEM DEFINITION 

We now look more closely at the network lasso problem, 

minimize ^ WjkWxj — Xk\\ 2 - 

iev U,k)ee 

This problem is convex in the variable x = {xi,..., Xm) G 
and we let x* denote an optimal solution. 

Local Variables. It is worth noting that there can be local private 
optimization variables at each node that are not part of the lasso 
penalty. More formally, the network lasso problem can be defined 


minimize 


E Mxi 


ei) + '^ E w'jfcl 

0',/c)G£ 


II2 ? 


(3) 


where Si are potential dummy variables at node i (the size can vary 
at each node). However, using partial minimization, if we let 

fi{xi) = min fi{xi, Si), 

£i 

we get the original problem, defined in ([^. For simplicity, we 
therefore use problem 0 throughout the paper, with the implicit 
understanding that there may be private variables at each node. 

Regularization Path. Although the regularization parameter A in 
problem 0 can be incorporated into the Wij’s by scaling the edge 
weights, it is best viewed separately as a single parameter which 
is tuned to yield different global results. A defines a trade-off for 
the nodes between minimizing its own objective and agreeing with 
its neighbors. At A = 0, a;*, the solution at node i, is simply a 
minimizer of fi. This can be computed locally at each node, since 
when A = 0 the edges of the network have no effect. At the other 
extreme, as A —> oo, problem turns into 


E Mi), 

iev 


(4) 


since a common x must be the solution at every node. This is solved 
by € R^. We refer to l|^ as the consensus problem and 

to as the consensus solution. If a solution to 0 exists, it 

can be shown that there is a finite Acrittcai such that for any A > 
Acriticai, the consensus solution holds. That is, beyond this Acriticai, 
increasing A has no effect on the solution. For A’s in between A = 0 
and Acriticai, the family of solutions is known as the regularization 
path, though it is sometimes referred to as the clusterpath (El 

Network Lasso and Clustering. The ^ 2 -norm penalty over the 
edge difference, \\xj — 2 ;fe|| 2 , defines the network lasso. It incen- 
tivizes the differences between connected nodes to be exactly zero, 
rather than just close to zero, yet it does not penalize large outliers 
(in this case, node values being very different) too severely. An 
edge difference of zero means that Xj = Xk. When many edges 
are in consensus like this, we have grouped the nodes into sets with 
equal values of Xi. Each set of nodes, or cluster, has a common so¬ 
lution for the variable Xi. The outliers then refer to edges between 
nodes in different clusters. Cluster size tends to get larger as A in¬ 
creases, until at Acriticai the consensus solution can be thought of as 
a single cluster for the entire network. Even though increasing A is 
most often agglomerative, cluster fission may occur, meaning two 
nodes in the same cluster may break apart at a higher A. Therefore, 
the clustering pattern is not strictly hierarchical |22| . 

Inference on New Nodes. After we have solved for x*, we can 
interpolate the solution to estimate the value of Xj on a new node 
j, for example during cross-validation on a test set. Given j, all 
we need is its location within the network; that is, the neighbors 
of j and the edge weights. With this information, we treat j like 
a dummy node, with fjixj) = 0. We solve for Xj just like in 
problem 0 except without the objective function fj, so the opti¬ 
mization problem becomes 


minimize 


E Wjk\\Xj 

keN(j) 


II 2, 


(5) 


where N{j) is the set of neighbors of node j. This estimate of Xj 
can be thought of as a weighted median of j’s neighbors’ solutions. 
This is called the Weber problem, and it involves finding the point 
which minimizes the weighted sum of distances to a set of other 
points Q. It has no analytical solution when j has more than two 


neighbors, but it can be readily computed even for large problems. 
For example, when one of the dimensions is much larger than the 
other (number of neighbors vs. size of each Xk), the problem can 
be solved in linear time with respect to the larger dimension Q. 

3. PROPOSED SOLUTION 

On smaller graphs, the network lasso problem can be solved us¬ 
ing standard interior point methods. This paper focuses on large 
problems, where solving everything at once is infeasible. This is 
especially true when we solve for a span of A’s across the entire 
regularization path, since we will need to solve a separate problem 
for each A. A distributed solution is necessary so that computa¬ 
tional and storage limits do not constrain the scope of potential 
applications. We propose an easy-to-implement algorithm based 
on the Alternating Direction Method of Multipliers (ADMM) 

|21| , a well-established method for distributed convex optimization. 
With ADMM, each individual component solves its own private 
objective function, passes this solution to its neighbors, and repeats 
the process until the entire network converges. There is no need for 
global coordination except for iteration synchronization. 

3.1 ADMM 

To solve via ADMM, we introduce a copy of Xi, called Zij, at 
each edge ij. Note that the same edge also has a Zji, a. copy of Xj. 
We rewrite problem 0 as an equivalent problem, 

minimize E ^ E Wjk\\zjk - Zkjh 

iev U,k)es 

subject to Xi = Zij, i = 1,..., m, j € N{i). 

We then derive its augmented Lagrangian (El which gives us 

Lp{x,Z,u) = '^fi{xi) + ^ iXWjk\\Zjk-Zkj\\ 2 - 
iev (j,k)ee^ 

(p/2) (llujfclli + llwfeilli) + 

(p/^) (11^2 4“'Ujfc||2 “1“ \[Xk Zkj 4“ U/cj II 2 )^ 5 

where u is the scaled dual variable and p > 0 is the penalty pa¬ 
rameter §3.1.1]. ADMM consists of the following steps, with k 
denoting the iteration number: 

fc+1 • r / fc k\ 

X = QxgminLp[x^z ,u ) 

X 

fc+1 • J / fc+1 fc\ 

2 : = argminLp(x ) 

2 

fc+1 fc I / fc+1 fc+i\ 

u = u -|-( 2 : — z ). 

Let us examine each of these steps in more detail. 

a;-Update. In the a:-update we minimize a separable sum of func¬ 
tions, one per node, so it can be calculated independently at each 
node and solved in parallel. At node i, this is 

xM^ = argmin [ fi{xi) + ^ [p/MM “ ) • 

«-Update. The 2 -update is separable across the edges. Note that 
for edge ij, we need to jointly update Zij and Zji. This becomes 

)c-|-l fc + 1 • I \ II II I 

Zij ,Zji =&rgmm\\Wij\\Zij — Zji\\ 2 -\- 

{pI‘2) - Zij + '^’ijWi + lla:^''''^ “ + “AII 2 ) 


This problem has a closed-form analytical solution, which we de¬ 
rive in Appendix A. It is 

z*j = 6{xi + Uij) -I- (1 - 0){xj + Uji) 
z*i = (1 - 0){xi + Uij) -I- 6{xj + Uji), 


where 


6 = max 



_ Xwjj _ 

p\\Xi+Uij - iXj+Uji)\\2 



( 6 ) 


M-Update. The u-update is also edge-separable. For each variable, 
this looks like 


fe+i 


k I / fc+1 fc+l\ 

= Uij + [Xi - Zij ) 



Global Convergence. Because the problem is convex, ADMM is 
guaranteed to converge to a global optimum. The stopping criterion 
can be based on the primal and dual residuals, commonly defined 
as r and s, being below given threshold values; see Q. This allows 
us to stop when x'° and z'^ are close, and when x'^ (or z’^) does not 
change much in one iteration. As is typical for ADMM, the algo¬ 
rithm tends to attain modest accuracy relatively quickly, and high 
accuracy (which in many applications is not needed) only slowly. 


Algorithm 1 ADMM Steps 

repeat 

/ 

\ 

x^^ 

= argmin i .U{xi) + Y) 

(p/2)||Xi — 4 + Uij\\2 ] 


=0i \ jeN(i) 

J 


= 6{xi + Uij) -f (1 - 9){xj 

+ Uji) 


= (1 - 61)(xi -1- Uij) + B{xj 

+ Uji) 

Uij 

= u% + {xY^ - 4+^) 


until ||r*^|| 

^ pri. II ...fell ^ dual 

2 < ; \\s ||2 < e 



3.2 Regularization Path 

It is often useful to compute the regularization path as a function 
of A to gain insight into the network structure. For specific appli¬ 
cations, this may also help decide the correct value of A to use, for 
example by choosing A to minimize the cross-validation error. 

We begin the regularization path at A = 0 and solve for an in¬ 
creasing sequence of A’s (A := aA, a > 1). We know when 
we have reached Acriticai because a single will be the opti¬ 
mal solution at every node, and increasing A no longer affects the 
solution. This may lead to a stopping point slightly above the ac¬ 
tual Acriticai, wMch we denote as Acriticai- There is no harm if 
Acriticai > Acriticai, since they will both yield the same result, the 
consensus solution. To account for the case where no consensus so¬ 
lution exists, we can also stop when the new solution has changed 
by less than some e, since even without consensus, the problem 
converges to some solution. 

A big advantage of the regularization path, as opposed to com¬ 
puting each value of a:* (A) in parallel, is that we begin with a warm 
start towards the new solution at each step. For each A, the un¬ 
known variables are already close to the new x*, u*, and z* by 
virtue of starting at the solution for the last A. In fact, when fi is 
strictly convex, the solution x* is continuous in A. Without any 
prior knowledge, for example initializing everything to 0 for each 
A, we start far from the actual solution, so it will often (although 
not always) take longer to converge via ADMM. The only other re¬ 
quired variable is Ainitiai, the initial non-zero value of A, which de¬ 
pends on the variable scaling. The hope is that x* does not change 


too much between A = 0 and this initial value, and a rough estimate 
of Ainitiai can be found using the following heuristic: 

1. Pick edge ij at random and find x*,x*j at A = 0. 

2. Evaluate the gradients of fi{x) and fj{x) at x = (x^ -|- 
x*)/2. 

3. Set Ainitiai i=0.0l(EM£)MSM£m). 

To get a more robust estimate, repeat the above steps picking differ¬ 
ent edges each time, and choose the smallest solution for Ainitiai- 
Given these variables, we are now able to solve for the entire regu¬ 
larization path. Our method is outlined in Algorithm]^ 

Algorithm 2 Regularization Path 
initialize Solve for x*, u*, z* at A = 0. 
set A .— Ainitiai, CT ^ f, U .— U ^ Z .— Z . 

repeat 

Use ADMM to solve for x*(A) (see Algorithm[^ 

Stopping Criterion, quit if x*(A) = x*(Aprevioua) 

Set A := aA. 

return x*(A) for A from 0 to Acriticai- 


4. NON-CONVEX EXTENSION 

In many applications, we are using the group lasso as an approx¬ 
imation of the £o-norm That is, we are looking for a sparse 
solution where relatively few edge differences are non-zero. How¬ 
ever, once II Xi — Xj ||2 becomes non-zero, we do not care about its 
magnitude, since we already know that i and j are in different clus¬ 
ters. The lasso has a proportional penalty, which is the closest that 
a convex function can come to approximating the ^o-norm. Once 
we have found the true clusters, though, this will “pull” the differ¬ 
ent clusters towards each other through their mutual edges. If we 
replace the group lasso penalty with a monotonically nondecreas¬ 
ing concave function (f>{u), where 0(0) = 0 and whose domain 
is u > 0, we come even closer to the Iq, as shown in Figure [T] 
However, this new optimization problem, 

minimize Y. “ ^kh ), (js 

iGV (j,k)£e ^ ^ 

is not convex. ADMM is not guaranteed to converge, and even if 
it does, it need not be to a global optimum. It is in some sense a 
“riskier” approach. In fact, different initial conditions on x, u, z, 
and p can yield quite different solutions. However, as a heuristic, 
a slight modification to ADMM empirically performs very well. 
Since the algorithm might not converge, it is necessary to keep track 
of the iteration which yields the minimum objective, and to return 















that as the solution instead of the most recent step. The primal and 
dual residuals are not guaranteed to go to 0 , so we instead run our 
algorithm for a set number of iterations for each A. 

Non-Convex z-Update. Compared to the convex case, the only 
difference in the ADMM solution is the z-update, which is now 


minimize 


Xwij(p{\\zij — Zji\\2) + ( p / 2 )(|| 2 ;*'''^ — Zij +Uij\\2 + 

II - Zji + UjiWl). 

(8) 


For simplicity, we define 

fc + 1 I k 

a = + Uij , 

c = Xwij , 


, k+\ , k 

b = Xj +Uji, 

d= \\a- b\\ 2 , 


so problem ID turns into 

minimize ccp i\\zij - Zji\\ 2 ) + (p/ 2 ) (||a - Zij\\l + \\b - ZjiWl) . 


There are two possible cases for the solution to problem l[^: 
Zij = Zji or Zij 7 ^ Zji. When the two solutions are identical, 
then (j>{\\zij — Z 7 i|| 2 ) = <(>( 0 ) = 0 , so the only terms remaining 
are 


(p/ 2 ) (IIa - ZijWl + \\b- ZjiWl) . 

Minimizing over the constraint that Zij = Zjt yields Zij = z'ji = 
(1/2)(a + h) and an objective of (p/4)||a — 6 || 2 . 

When the two solutions are not equal, and z^i must lie on 
the line segment between a and b. If 2 )) and/or Zji are not on the 
line segment, projecting them onto this segment is nonincreasing in 
(j) {\\zij — Zji\\ 2 ) and decreasing in (p/ 2 ) (||a — Zij\\l + ||b — ZjiHi). 
so the total objective function is guaranteed to decrease. Therefore, 
we know that 


Zij = 9ia + (1 - 9i)b, 9i G [0,1] 

2 *, = 92 a + (1 - 92)b, 92 G [ 0 , 1 ] 

and that 

|| 2 *J - 2 *i ||2 = ||a - b ||2 (If^i - ^ 2 !) = d\9i - 6 » 2 |. 

Note that the solution for 2 I) = 2 *^ is just 9i —92 = |. We 
also know that 9i > 02 - If 9i < 92, we could swap 9i and 92 
and(/i(|| 2 ij — Zji|| 2 ) would remain constant, but the rest of the ob¬ 
jective, (p/2) (||a — Zijili + 11^ ~ ^jilli)^ would decrease. There¬ 
fore, we can rewrite the norm of the difference as 

\\Zi3 - Z*ji\\2 = ^( 6*1 - 6 * 2 ), 

and the objective becomes 

0 / (d( 6 )i - 92 )) + (pdV2) ((1 - 6 >i)" + 9l) . 

When Zij 7 ^ 2 ^^, we know that 9\ > 92, and thus d{9\ — @ 2 ) > 0. 
When (j) is differentiable at d{9\ — ^ 2 ), we set the gradient to zero: 

J- = cd/)'(d(6»i - 92 )) - pd'il - Si) = 0 

^ = -cd(t>'{d{9i - 92 )) + pd^92 = 0. 

OU2 

We see that 

pd^{l - 93 ) = cd4>'{d{9i - 92)) = pd^92, 


or 


92 = 1-9-3. 


This puts the entire optimization problem in terms of one variable, 
9 = 92. Since + 02 = 1 and 9\ > 02 , we know that 0 < |, so 


the final problem becomes 

minimize cc/(d(l — 20 ))-F pd^ 0 ^ 
subject to 0 < 0 < I. 

It is of course necessary to find all solutions to this problem, since 
there may be several or none, and to compare the resulting objective 
to (p/4)||a — 6 II 2 , when 2*7 = 2 ^^. Of these solutions, pick the 2 ’s 
which minimize the overall objective function. 

Log Function. We will now look at the specific case where </(u) = 
log(l -F “), where e is a constant scaling factor. The objective 
function in problem turns into 

minimize clog(l -F ) -f pd^9^. 

Setting the derivative equal to zero, we get 


2cd 

d — 2d9 -F e 


-F 2pd^9 = 0. 


We simplify to 

2pd?9^ — pd{d -F e)9 -F c = 0 


and see that this is a simple quadratic equation in 0 , solved by 

0 - P(^ + <:) ± \/p^(rf + g)^ - 8pc 
4pd 

The 2 -update then involves comparing the resulting objectives with 
(p/4) ||a — 6 ||| (the value when 2/7 = Zji) and then choosing the 0 
which yields the best of the three objectives to obtain 2 *^, 2 ^. If the 
quadratic term has no real roots, which happens more frequently as 
A increases, we set 0 = |, meaning the edge is in consensus. It 
is worth reiterating that this method is not guaranteed to reach the 
global optimum. Instead, it is an easy-to-implement algorithm that 
parallels ADMM from the convex case. 


5. EXPERIMENTS 

We now apply our approach on three examples to illustrate the 
diverse set of problems that fall under the network lasso framework, 
and to provide a simple and unified view of these seemingly differ¬ 
ent applications. First, we look at a synthetic example in which 
we gather statistical power from the network to improve classi¬ 
fication accuracy. Next, we see how our approach can apply to 
a geographic network, allowing us to gain insights on residential 
neighborhoods by predicting housing prices. Finally, we look at a 
time series dataset for the purpose of detecting outliers, or anoma¬ 
lous events, in the temporal data. To run these experiments, we 
built a module combining Snap.py pV) and CVXPY pO) . The net¬ 
work is stored as a Snap.py structure, and the i-updates of ADMM 
are run in parallel using CVXPY. Even though this algorithm is 
capable of being distributed across many machines, we instead dis¬ 
tribute it across multiple cores of a single machine for our proto¬ 
type. Our network-based convex optimization solver is available 
at http: // snap. Stanford . edu/ snapvx, and the code for 
this paper can be found on the SnapVX website. 

5.1 Network-Enhanced Classification 

We first analyze a synthetic network in which each node has a 
support vector machine (SVM) classifier Q, but does not have 
enough training data to accurately estimate it. The clustering of 
the nodes in the network occurs because some of the nodes have 
common underlying SVMs. The hope is that nodes can, in essence, 
“borrow” training examples from their relevant neighbors to im¬ 
prove their own results. Of course, neighbors with different un¬ 
derlying models will provide misleading information to each other. 






These are the edges whose lasso penalties should be non-zero, yield¬ 
ing different solutions at the two connected nodes. 

Dataset. We randomly generate a dataset containing 1000 nodes, 
each with its own classifier, a support vector machine in Given 
an input w G R®'^, each node tries to predict y G { — 1 , 1 }, where 

y = sgn(a[ w + a^fi + v), 

and V ~ A/'(0,1), the noise, is independent for each data point. 
An SVM involves solving a convex optimization problem from a 
set of training examples to obtain Xi = [af ai,o] ^ £ R®^. This 
defines a separating hyperplane to determine how to classify new 
inputs. There is no way to counter the noise v, but an accurate Xi 
can help us predict y from w reasonably accurately. Each node 
determines its own optimal classifier from a training set consisting 
of 25 (w, j/)-pairs per node, which are used to solve for x. All 
elements in w, a, and v are drawn independently from a normal 
distribution, with the y values dependent on the other variables. 

Network. The 1000 nodes are split into 20 equally-sized groups. 

a aoj , while 

different groups have independent a’s. If i and j are in the same 
group, they have an edge with probability 0.5, and if they are in 
different groups, there is an edge with probability 0.01. Overall, 
this leads to a total of 17079 edges, with 28.12% of the edges con¬ 
necting nodes in different underlying groups. Even though this is 
a synthetic example, there are a large number of misleading edges, 
and each node has only 25 examples to train an SVM in R®®, so 
solving this problem is far from an easy task. 

Optimization Parameter and Objective Function. At node i, the 

rT I'T r T I'T 1 

optimization parameter Xi = ^ ®i,o 1 = [cti ai.oj de¬ 

fines our estimate for the separating hyperplane for the SVM GD 
The node then solves its own optimization problem, using its 25 
training examples. At each node, fi is defined as 

25 

minimize |||a:i,a||2 + I] clkilli 

i = l 

subject to 2 /^®^ -I- Xi^) > 1 — £i, i = 1,..., 25. 

The £i’s are (local) slack variables. They allow points to be mis- 
classified in the training set of a soft margin SVM We set c, 
the threshold parameter, to a constant which was empirically found 
to perform well on a common model. We solve for 51 -l- 25 = 76 
variables at each node, so the total problem has 76,000 unknowns. 

Results. To evaluate performance, we find prediction accuracy on a 
separate test set of 10,000 examples (10 per node). In Figure]^ we 
plot percentage of correct predictions vs. A, where A is displayed in 
log-scale, over the regularization path. Note that the two extremes 
of the path represent important baselines. 

At A = 0, each node only uses its own training examples, ignor¬ 
ing all the information provided by its neighbors. This is just a local 
SVM, with only 25 training examples to estimate a 51-dimensional 
vector. This leads to a prediction accuracy of 65.9% on the test 
set. When A > Acriticai, the problem finds a common x, which is 
equivalent to solving a global SVM over the entire network. This 
assumes the entire graph is coupled together and does not allow for 
any edges to break. This common hyperplane at every node yields 
an accuracy of 57.1%, which is barely an improvement over ran¬ 
dom guessing. In contrast, both the convex and non-convex cases 
perform much better for A’s in the middle. From Figure]^ we see 
a distinct shape in the regularization paths. As A increases, the ac¬ 
curacy steadily improves, until a peak near A = 1. Intuitively, this 
represents the point where the algorithm has approximately split the 



Figure 2: SVM regularization path. 


Method 

Maximum Prediction Accuracy 

Local SVM (A = 0) 

65.90% 

Global SVM (A > Acriticai) 

57.10% 

Convex Network Lasso 

86 .68% 

Non-Convex Network Lasso 

87.94% 


Table 1: SVM test set prediction accuracy. 


nodes into their correct clusters, each with its own classifier. As A 
continues to increase, there is a rapid drop off in performance, due 
to the different clusters “pulling” each other together. The maxi¬ 
mum prediction accuracies on the test sets are 86 . 68 % (convex) and 
87.94% (non-convex). These prediction results are summarized in 
Tabled 

Timing Results. We compare our network lasso algorithm to a 
standard centralized method on a single 40-core CPU where the 
entire problem fits into memory. For the centralized case, we used 
the same solver (CVXPY) as in the i-updates for ADMM. While 
wrapped in a Python layer, CVXPY’s underlying solver uses EGOS 
m an open-source software package specifically designed for 
high performance numerical optimization, so the Python overhead 
is negligible when it comes to the cost of scaling to large prob¬ 
lems. We show the results on the synthetic SVM example to scale 
the problem size over several orders of magnitude. We solve the 
problem at 12 geometrically spaced values of A to span the entire 
regularization path. We use ^ underlying SVM clusters, where n 
is the number of nodes. The entire regularization path is one large 
problem (consisting of 12 smaller ones), and we measure its total 
runtime. Note that each node in this case is solving its own SVM, 
with additional coupling constraints due to the network lasso on the 
edges. We vary the total number of nodes, and the results are shown 
in Figure]^ We see that, in this example, the centralized method 
scales on the order of problem size cubed, whereas ADMM takes 
closer to linear time, until other concerns such as memory limita¬ 
tions begin to factor in. By the time there are 20,000 unknowns, 
ADMM is already 100 times faster, and this discrepancy in conver¬ 
gence time only grows as the problem gets larger. 

To further test our algorithm, we also solve a larger yet simpler 
problem. We build a random 3-regular graph (every node has a de¬ 
gree of 3) with 2000 nodes. The objective function at each node 
is fi{xi) = '^Xi — flilli, where ai is a random vector in R'^. We 
can modify the value of q to vary the total number of unknowns. 
We pick a single (constant) A in the middle of the regularization 
path and see how long it takes to solve the problem using ADMM. 
The results are shown in Table We can compute a solution for 1 
million unknowns in seconds, and for 100 million in under 15 min¬ 
utes. It is worth reiterating that at each step, at each node, we use 
CVXPY rather than a more specialized solver for the x-update sub¬ 
problem. This allows the same solver to work on any convex node 
objective, rather than being constrained to specific classes of func- 








Number of Unknowns 


Figure 3: Convergence comparison between centralized and 
ADMM methods for SVM problem. 


Number of Unknowns 

ADMM Solution Time (seconds) 

100,000 

12.20 

1 million 

18.16 

10 million 

128.98 

100 million 

822.62 


Table 2: Convergence time for large-scale 3-regular graph 
solved at a single (constant) valne of A. 


tions, and yet it is still able to scale to tens of millions of unknown 
variables. 

5.2 Spatial Clustering with Regressors 

In this example, as described in the introduction, we attempt to 
estimate the price of homes based on latitude/longitude data and a 
set of features. Home prices often cluster together along neighbor¬ 
hood lines. In this case, the clustering occurs when nearby houses 
have similar pricing models, while edges that have non-zero edge 
differences will be between those in different neighborhoods. As 
houses are grouped together, each cluster builds its own local linear 
regression model to predict prices in its region. Then, when there 
is a new house, we can infer its regression model from the local 
neighborhood to estimate the sales price. 

Dataset. We look at a list of real estate transactions over a one- 
week period in May 2008 in the Greater Sacramento are^ This 
dataset contains information on 985 sales, including latitude, lon¬ 
gitude, number of bedrooms, number of bathrooms, square feet, 
and sales price. However, as often happens with real data, we are 
missing some of the values. 17% of the home sales are missing at 
least one of the features; i.e., some of the bedroom/bathroom/size 
data is not provided. The price and all attrihutes are standardized 
to zero mean and unit variance, so any missing features are ignored 
hy setting the value to zero, the average. To verify our results, we 
use a random subset of 200 houses as our test set. 

Network. We huild the graph by using the latitude/longitude coor¬ 
dinates of each house. After removing the test set, we connect every 
remaining house to the five nearest homes with an edge weight in¬ 
versely proportional to the distance between the houses. If house j 
is in the set of nearest neighbors of i, there is an undirected edge 
regardless of whether or not house i is one of j’s nearest neigh¬ 
bors. The resulting graph leaves 785 nodes, 2447 edges, and has a 
diameter of 61. 

'Data available at http://support.spatialkey.com/ 
spatialkey-sample- CSV- data/ 




(a) Convex (b) Non-Convex 


Figure 4: Regularization path for housing data. 


Method 

Mean Squared EiTor (MSE) 

Geographic (A = 0) 

0.6013 

Resularized Linear Regression (A > Ar-ritirai) 

0.8611 

Naive Prediction (Global Mean) 

1.0245 

Convex Network Lasso 

0.4630 

Non-Convex Network Lasso 

0.4539 


Table 3: MSE for housing price predictions on test set. 


Optimization Parameter and Objective Function. At each node, 
we solve for 

bi Ci , 

which gives us the weights of the regressors. The price estimate is 
given by 

pricCj = ai ■ Bedrooms + bi ■ Bathrooms -I- Ci ■ SQFT + di, 

where the constant offset di is the “baseline”. To prevent overfit¬ 
ting, we regularize the ai, bi, and a terms, everything besides the 
offset. The objective function at each node then becomes 

fi = llprice^ - priceJli -I- ^ \\xi\\^^ 

where Xi = [oi bi Ci]^, price^ is the actual sales price, and fj, 
is a constant regularization parameter. 

To predict the prices on the test set, we connect each new house 
to the 5 nearest homes, weighted hy inverse distance, just like be¬ 
fore. We then infer the value of Xj at node j by solving problem 
0. and we use this value to estimate the sales price. 

Results. We plot the mean squared error (MSE) vs. A in Figure]^ 
for hoth the convex and non-convex formulations of the problem. 
Once again, the two extremes of the regularization path are relevant 
baselines. 

At A = 0, the regularization term in fi{xi) insures that the only 
non-zero element of Xi is di. This ignores the regressors and is a 
prediction based solely on spatial data. Our estimate for each new 
house is simply the weighted median price of the 5 nearest homes, 
which leads to an MSE of 0.6013 on the test set. For large A’s, we 
are fitting a common model for all the houses. This is just regu¬ 
larized linear regression on the entire dataset and is the canonical 
method of estimating housing prices from a series of features. Note 
that this approach completely ignores the geographic network. As 
expected, it performs rather poorly, with an MSE of 0.8611. Since 
the prices are standardized with unit variance, a naive guess (with 
no information about the house) would just be the global average of 
the training set, which has an MSE of 1.0245. The convex and non- 
convex methods are both maximized around A = 5, with minimum 
MSE’s of 0.4630 and 0.4539, respectively. 

We can visualize the clustering pattern by overlaying the net¬ 
work on a map of Sacramento. We plot each sale with a marker, 
colored according to its corresponding Xi (so houses with similar 
colors have similar models, and those with the same color are in 
consensus). With this, we see how the clustering pattern emerges. 





















(c) A = 10 

Figure 5: Regularization path clustering pattern. 


only containing events officially reported by the coordinator, and 
many unreported events likely occurred during this interval. There¬ 
fore, “false positives” are not necessarily incorrect, so the absolute 
results (how accurately we predict the events) are not a perfect in¬ 
dicator of performance. However, this provides a good benchmark, 
especially when compared to a common baseline. 

Dataset. The data comes from the main door of the Calit2 building 
at UC Irvine. This count data, the number of entries and exits, is 
reported once every 30 minutes over the course of 15 weeks from 
July to November 2005, for a total of 5,040 reading^ Additionally, 
we use a list of the 30 official events which occurred inside the 
building during that interval. 

Network. We build a linear network where node i, covering the 
ith interval in the time series, has only two edges. These connect 
it to nodes i — 1 and i + 1. The first and last nodes only have one 
edge, leaving 5,040 nodes and 5,039 edges. There are more com¬ 
plicated ways to model the coupling of time series data, but we opt 
for simplicity since our goal is to show one approach, rather than 
necessarily the optimal method, of solving this class of problems. 

Optimization Parameter and Objective Function. Traffic is pe¬ 
riodic on a weekly basis. That is, a relatively similar number of 
people enter and exit the building on, for example, Mondays from 
1:00 - 1:30PM. We do not care for instance that there is more traffic 
at 1:00 PM than at 1:00 AM. This is not an indicator that an event 
occurred at 1PM. Instead, we care about the number of people rel¬ 
ative to the periodic signal. We let 


Xi 


ini — in(i mod 336) 
outi — out(i mod 336) 


where in(i mod 336) and out(i mod 336) are the median value of 
entrances/exits for the given time and day of the week (7 • 24 • 2 = 
336) over the 15 week interval. We use the median because the 
mean can be skewed by the increases due to actual events. 

The objective function is defined as 

= ||a:i - Xi\\\ + Mlkilh- 


In Figure]^ we look at this plot for three values of A. In |5(a)[ A 
is too small, so the neighborhoods have not yet formed. On the 
other hand, in |5(b)| A is too large. The clustering is clear, but it 
performs poorly because it forces together neighborhoods which 
are very different. Figure [5(c^ is a viable choice of A, leading to 
low MSE while showing a clear partitioning of the network into 
neighborhoods of different sizes. 

Aside from outperforming the baselines, this method is also well- 
suited to detect and handle anomalies. As shown in the plots, out¬ 
liers are often treated as single-element clusters, for example the 
yellow house on the right side of ]5(c)[ These houses are ones which 
do not fit in with their local model (for a variety of possible rea¬ 
sons), but using the network lasso, neither they nor their neighbors 
are adversely affected too significantly by each other. Of course, as 
A approaches Acriticai, these clusters are forced together into con¬ 
sensus. However, near the optimal A, we accurately classify these 
anomalies, isolate them from the rest of the graph, and build sepa¬ 
rate and relatively accurate models for both subsets. 

5.3 Event Detection in Time Series Data 

Lastly, we aim to predict the existence of certain “events” in a 
building, those which were officially listed by the building coordi¬ 
nator. We are given the entry and exit data from the building over 
a 15 week interval. For these events, we expect to see an anoma¬ 
lous increase in traffic. Note that this is just a partial ground truth. 


The variable that we optimize over, Xi, is an attempt to match the 
non-periodic signal at that time. The regularization term on Xi is 
a lasso penalty, so only a select few of the x’s will be non-zero. 
These non-zero values refer to the times of the anomalous events 
that we are trying to predict. It is worth noting that for any finite 
network lasso parameter A, there exists a ^ large enough so that 
every Xi is guaranteed to be [0, 0]^. 

An event often manifests itself as a sustained period of increased 
activity. Therefore, we declare an event on the interval [f, f -|- fc] if 

Xi^in J- a:i,out ^0 z ^ [f, f “t“ . 

We vary /i to change the number of events predicted. For small fi, 
the slightest noise can be interpreted as an event. Large n’s lead 
to fewer predictions, until eventually every x{t) is forced to 0, as 
mentioned before. The parameter A determines the average event 
length, as it encourages prolonged increases in activity and discour¬ 
ages single outliers from being picked up. However, in this exam¬ 
ple, the model is relatively robust to changes in A (up to a certain 
point), so we keep it constant as we vary /z, as a slight modification 
of the regularization path from previous experiments. 

Baseline. This type of problem is often modeled as a Poisson pro¬ 
cess, so we use that as our baseline method (T5) We consider each 

^Data from https : //archive . ics ■ uci . e du/ ml/ 
datasets/CalIt2+Building+PeopletCounts |18|. 













Number of Correct Events Detected 

Convex 

Predicted Events 
Non-Convex 

Poisson 

30 

146 

201 

264 

29 

125 

135 

214 

28 

116 

121 

201 

27 

101 

116 

188 

26 

97 

114 

131 

24 

76 

78 

100 

18 

56 

64 

62 


Table 4: Number of required predictions to detect events. 


time and day of the week as having an independent Poisson rate A 
(which is unrelated to the regularization parameter with the same 
name in the network lasso). We set A, the “expected” number of 
count data, to the maximum likelihood estimate of a Poisson pro¬ 
cess, the mean of the 15 values. Ain and Aout are calculated inde¬ 
pendently. We define an event from [t,t -\- k] if 




“A 




Nin{i)\ 


‘A, 


JVout(i) 


Wout(t)! 


< £ i e [t,t + k]. 


This says that the given number of entries and exits at time i occurs 
with probability less than e. Since only large totals should trigger 
a predicted event (rather than abnormally low entry/exit numbers), 
one final requirement is that either Nin > Xin or Nout > Xout for 
every t in the interval. Varying the threshold e, similar to fj, for our 
approach, changes the number of predicted events. 

Results. For both our model and the baseline, we compute the 
number of correct events vs. number of predicted events. We de¬ 
fine a correct prediction as one in which the prediction and the true 
event overlap. The accuracy of all three approaches at several key 
points is summarized in Table As shown, both the convex and 
non-convex methods outperform the Poisson baseline (though the 
convex approach does noticeably better than the non-convex). The 
Poisson is able to catch the “low-hanging fruit”, the easy-to-detect 
events, with relatively good accuracy. The discrepancy arises in the 
less obvious ones. Again, this is just a partial ground truth and it is 
likely that there are many more than 30 events, but the poor perfor¬ 
mance of the Poisson method — it takes 264 predictions to find all 
30 events — suggests that it may be an imperfect method of event 
detection. Note that more complicated models, specifically tuned 
for outlier detection, may beat these results. For example when an 
event occurs, we expect to see a large spike in inbound traffic at the 
beginning of the event, and a similar outbound one at the end. Our 
approach could easily be modified in future work to account for ad¬ 
ditional information such as this. Flowever, as a simple model and 
a proof of concept, these results are very encouraging. 


6. CONCLUSION AND FUTURE WORK 

In this paper, we have shown that within one single framework, it 
is possible to better understand and improve on many common ma¬ 
chine learning and network analysis problems. The network lasso 
is a useful way of representing convex optimization problems, and 
the magnitude of the improvements in the experiments show that 
this approach is worth exploring further, as there are many poten¬ 
tial ideas to build on. The non-convex method gave comparable 
performance to the convex approach, and we leave for future work 
the analysis of different non-convex functions It is also pos¬ 

sible to look at the sensitivity of these results to the structure of the 
network. For example, we could attempt to iteratively reweigh the 
edge weights to attain some desired outcome. Within the ADMM 
algorithm, there are many ways to improve speed, performance. 


and robustness. This includes finding closed-form solutions for 
common objective functions fi{xi), automatically determining the 
optimal ADMM parameter p, and even allowing edge objective 
functions fe{xi,Xj) beyond just the weighted network lasso. As 
this topic develops further, there is an opportunity for easy-to-use 
software packages which allow programmers to solve these types 
of large-scale optimization problems in a distributed setting without 
having to specify the implementation details, which would greatly 
improve the practical benefit of this work. 
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APPENDIX 

A. ANALYTICAL SOLUTION TO z-UPDATE 

We will show that the solution to 
minimize Xwij\\zij — Zji \\2 + (p/ 2 )(||a;^+^ — Zij + 

II — Zji + ujilli) 1 

with variables Zij and Zji, is 

Zij = 6{xi + Uij) -I- (1 - 0){xj + Uji) 

Zji — (1 O^i^Xi -\~ Uij^ “h 0(^Xj “h Ujif 

where 6 is defined in equation (| 6 }. 

We first note that the objective is strictly convex, so the solution 

is unique. As in §4, we let 

fe+1 I k j_ fe+1 I k \ 

a — Xi +Uij, 0 = Xj + Uji, c= Xwij, 

so the original problem turns into 
minimize c\\zij — Zji \\2 -I- (p/2) (||a — -I- ||fe — ZjiHi) • 

There are two possible cases for the optimal values zT and zL. 

Case 1: z*j = zL. If the two variables are equal, then \\zij — 

Zji II2 = 0 , SO the only terms remaining are 


Minimizing over the constraint that Zij = Zji yields zL = zL = 
(l/2)(a -I- 6 ), with objective value p/4||a — 6 || 2 . 

Case 2: zT 7 ^ Zji- When the two variables are not equal, the 
objective is differentiable. In this case, the necessary and sufficient 
condition for optimality is V/ = 0, or 

V (c||zij - Zji \\2 + (p/2)||a- Zijili -h (p/2)||fe- Z. 

The gradient can be written as 

■-p(a - Zij) 


)= 0 . 


Zii —ZA. 



'O' 


0 . 


so the two equations that must be satisfied are 


— p{a-Zij) = 0, -c- 


Zij Zj- 


— p{b-Zji) = 0. 


\\Zij Zji\\2 \\Zij Zji\\2 

Letting fj. = \\zij — Zji\\ 2 , we get 

c{zij - Zji) = pp{a - Zij), -c{zij - Zji) = pp{b - Zji). 

Adding the two equations gives 


' H“ Zji — CL b, 


and subtracting them leads to 


Zij Zji — 


pp{a - b) 


2c J- pp 

Treating ^ as a constant, this yields a system of linear equations for 
Zij and Zji, which we solve to obtain 

Zij = 6a + {1 — 6)b, Zji = (1 — d)a + db. 


where 


e^--\ --. 

2 4c+2pp 

We know that p = \\zij — Zji\\ 2 , so we plug in for Zij and Zji, 


f-L — II Zij Zji II2 

which reduces to 


pp[a - b) 


1 = 


2c -I- pp 
P II 


pp 


2c -I- pp 


2c -I- pp 
From this, we can solve for p, 

F ~ \\u — &II 


P ' 


We plug in p to solve for 6, which yields 


'=2 + 


This is then reduced to 


(||a- b ||2 - y) P 
4c + 2 p('||a-b|| 2 - f 


^ 1 p||a - 6||2 - 2 c 
2 ”^ 2 p||a-fo ||2 ■ 


6 = 1 - 


p\\a - I 


However, this only holds if Zij ^ Zji. When this condition is not 
satisfied, we know the solution is case 1 , which is equivalent tod = 
I. When it is satisfied, we need to compare the resulting objective 
with p/4||a — 6 |||, the value from case 1. Routine calculations show 
that this holds when P > |. Therefore, combining these equations 
and plugging in for a, b, and c, we arrive at our solution. 


6 = max I 1 — 


Xwi 


p\\Xi+Uij - {Xj+Uji)\\2 


0.5 


























